Climate change is a global challenge jeopardizing humanity’s future. This purpose of this project is to investigate the relationship between CO2 emissions, one of the largest drivers of climate change creating greenhouse gasses, and the use of renewable energy such as solar, wind, or hydroelectric energy. The data to be used in this project incorporates data from two different sources, one measuring total carbon dioxide emissions per country over the years and one measuring proportion of energy use which is renewable over the years. Total_CO2_Emission is in 1000 tons of CO2 and Percent_Consumption_Renewables is a percentage of total energy use produced by renewable sources per country over the years. Because the data is collected on individual countries over the years, this investigation will primarily focus on the Average Emissions as well as the Average Percent Renewable of all the countries in a specific year. In the data of which the averages will be investigated, there are 207 distinct countries observed over 29 years. The relationship between Average Emissionsand Average Percent Renewable is hypothesized to be negative whereas humanity used more renewable energy on average there would be less average CO2 emissions. Through this investigation hopefully more information can be learned about climate change and potential solutions to this global problem affecting every organism on planet Earth.
2 Materials and Methods
The master data to be used for analysis in this project incorporates the averages for all countries over the years from twodifferent data sets, one measuring the total carbon dioxide emissions, per country over the years and the other measuring the proportion of energy use classified as renewable per country over the years. The data sets were sourced from an online database titled Gapminder which has been a reliable provider of data since 2005, in hopes of promoting global sustainability through easily accessible information.
Before analysis, the data had to be properly cleaned and wrangled. The CO2 data set contained values in the form of 25.3k and 4.9M, signaling the units of thousands and millions. The first step in the data cleaning process was to replace these characters with numbers to create a numeric variable which calculations could be performed on. Once the data was of all the same type, the data was then pivoted to be in tidy form and the years of interest (1989 – 2017) were selected. Once pivoted, the data sets were joined by year and country to produce a clean master data set which shall be used for model fitting, plotting, and analysis. A final decision was made to drop all the Na’s in the master set due to a few reasons. The missing values typically occurred in the Average Percent Renewable variable across the earlier years measured and usually in countries with smaller populations. Whether these Na’s were included in the data due to lack of observation or for simply not having any renewable energy production or consumption it is unclear. Because the averages are being studied and the Na’s usually occured in smaller countries, there was still a large enough sample size to average over once those observations were dropped. Other forms of imputation were contemplated such as cell-mean imputation but not conducted due to fears of introducing bias to the data.
Linear regression, a method involving predicting the values of one variable, based on another, through producing a straight line minimizing the value for the sum of squared residuals, was used to create and predict a model. All the data cleaning, models, and subsequent analysis was conducted using R code.
Code
datatable((head(master, n =50)),caption ='Interactive Preview of Data Set')
<<<<<<< HEAD
=======
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Above is the simple linear regression equation based on the model predicting the response variable, Average Emissions (\(\hat{y}\)), by the explanatory variable, Average Percent Renewable. The coefficient on Average Percent Renewable is extremely negative sitting at -17770.2 meaning that for each one percent increase in the Average Percent Renewable energy used the average amount of CO2 emissions in thousands of tons decreases by 17770.2. The slope coefficient of 693309.6 means that when the Average Percent Renewable is zero such that there is no renewable energy being used at all on average, the predicted Average Emissions would be 693309.6 thousand tons.
Code
<<<<<<< HEAD
# Plot 1raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Average Renewable Energy Consumption (%)", y ="", title ="Relationship between Renewable Energy Usage and CO2 Emissions", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'),plot.subtitle =element_text(size =10),axis.title.x =element_text(size =10))raw_graph
=======
# Plot 1raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Average Renewable Energy Consumption (%)", y ="", title ="Relationship between Renewable Energy Usage and CO2 Emissions", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'),plot.subtitle =element_text(size =10),axis.title.x =element_text(size =10))raw_graph
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
The graph above demonstrates the relationship between Average Emissions and Average Percent Renewable. The distribution illustrates a negative linear relationship, where the points are relatively close to the plotted regression line with little deviation and noise. There are little to no unusual observations. This illustration is consistent with the hypothesis that as the Average Percent Renewable increases the Average Emissions decreases at a significant rate.
Code
<<<<<<< HEAD
co2_by_year_graph <- master |>ggplot(aes(x = Year , y =`Average Emissions`)) +geom_point() +scale_x_discrete(guide =guide_axis(n.dodge=2)) +labs(x ="Year", y ="", title ="Average CO2 Emissions Over Time", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'))energy_by_year_graph <- master |>ggplot(aes(x = Year, y =`Average Percent Renewawable`)) +geom_point() +scale_x_discrete(guide =guide_axis(n.dodge=2)) +labs(x ="Year", y ="", title ="Average Percentage of Renewable Energy Over Time",subtitle ="Average Renewable Energy Consumption (%)") +theme(plot.title =element_text(hjust =0.5, face ='bold'))grid.arrange(co2_by_year_graph, energy_by_year_graph)
=======
co2_by_year_graph <-plot_ly( master, x =~ Year, y =~`Average Emissions`,type ='scatter',marker =list(color ='red'))co2_by_year_graph <- co2_by_year_graph |>layout(title ='Average Emissions Over Time',yaxis =list(title ='Average Emissions (1000 tons)',titlefont =list(size =14),xaxis =list(title ='Year',titlefont =list(size =14))))energy_by_year_graph <-plot_ly( master, x =~ Year, y =~`Average Percent Renewable`,marker =list(color ='green'),type ='scatter')energy_by_year_graph <- energy_by_year_graph |>layout(title ='Average Percent Renewable Over Time',yaxis =list(title ='Average Percent Renewable',titlefont =list(size =14)),xaxis =list(title ='Year',titlefont =list(size =14)))co2_by_year_graph
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Code
energy_by_year_graph
As shown by the two distributions of Average Emissions and Average Percent Renewable over time, the relationship follows a negative relationship, but perhaps not as expected. Average Percent Renewable is decreasing over the years while Average Emissions is increasing which still illustrates a negative relationship. As time goes on it makes sense as to why Average Emissions is increasing, because of extreme population growth and growing demand for production but Average Percent Renewable has shockingly been declining in recent years. While this likely has something to due to the varying definition of renewable energy, for example whether or not nuclear energy is truly renewable, it is surprising that as technology develops renewable energy use does not. This signifies that humanity needs to increase the renewable energy production and usage on average in order to reduce the carbon footprint and preserve nature for future generations.
Model Fit:
Code
<<<<<<< HEAD
energy_emissions_model |>augment() |>summarize(`Variance of Fitted`=var(.fitted),`Variance of Residuals`=var(.resid),`Variance of Average CO2 Emissions`=var(`Average Emissions`)) |>kable(caption ='Model Fit',digits =3,format.args =list(big.mark =",")) |>kable_styling(bootstrap_options =c('striped', 'bordered'))
=======
energy_emissions_model |>augment() |>summarize(`Variance of Fitted`=var(.fitted),`Variance of Residuals`=var(.resid),`Variance of Average CO2 Emissions`=var(`Average Emissions`)) |>kable(caption ='Model Fit',digits =3,format.args =list(big.mark =",")) |>kable_styling(bootstrap_options =c('striped', 'bordered'))
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Model Fit
Variance of Fitted
Variance of Residuals
Variance of Average CO2 Emissions
415,649,404
46,532,518
462,181,923
The proportion of variability in the response values that was accounted for by the model, \(R^{2}\), was very large at about at about 89.93 percent. This suggests a good quality model, where a lot, about 89%, of the variation in the response, Average Emissions is explained by the explanatory variable, Average Percent Renewable. This suggests that a high proportion of variability in response is accounted for by the linear model and there are not many other large factors influencing emissions.
Code
<<<<<<< HEAD
energy_emissions_model |>augment() |>ggplot(aes(x=.fitted, y = .resid)) +geom_point() +labs(y ='',subtitle ='Residuals',x ='Fitted Values',title ='Relationship between Residual and Fitted Values') +theme(plot.title =element_text(hjust =0.5, face ='bold'))
=======
energy_emissions_model |>augment() |>ggplot(aes(x=.fitted, y = .resid)) +geom_point() +labs(y ='',subtitle ='Residuals',x ='Fitted Values',title ='Relationship between Residual and Fitted Values') +theme(plot.title =element_text(hjust =0.5, face ='bold'))
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Simulation:
Code
<<<<<<< HEAD
noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}
=======
noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Code
<<<<<<< HEAD
master_predict <-predict(energy_emissions_model)master_sigma <-sigma(energy_emissions_model)sim_response <-tibble(sim_emissions =noise(master_predict, sd = master_sigma))raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")
=======
master_predict <-predict(energy_emissions_model)master_sigma <-sigma(energy_emissions_model)sim_response <-tibble(sim_emissions =noise(master_predict, sd = master_sigma))raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")
sim_data <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewawable`) ) |>select(`Average Emissions`, `Average Percent Renewawable`) |>bind_cols(sim_response)raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Observed Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")sim_master_graph <- sim_data |>ggplot(aes(x =`Average Percent Renewawable`, y = sim_emissions) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Simulated Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Simulated Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")raw_graph + sim_master_graph
Error in `ggplot_add()`:
! Can't add `sim_master_graph` to a <ggplot> object.
=======
sim_data <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewable`) ) |>select(`Average Emissions`, `Average Percent Renewable`) |>bind_cols(sim_response)raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Observed Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")sim_master_graph <- sim_data |>ggplot(aes(x =`Average Percent Renewable`, y = sim_emissions) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Simulated Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Simulated Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")raw_graph + sim_master_graph
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Code
<<<<<<< HEAD
# Check the similarity between the simulated data and observed datasim_data |>ggplot(aes(x = sim_emissions, y =`Average Emissions`) ) +geom_point() +labs(x ="Simulated Avg CO2 Emissions/1000 tonnes", y ="",subtitle ="Avg CO2 Emissions/1000 tonnes" ) +geom_abline(slope =1,intercept =0, color ="steelblue",linetype ="dashed",lwd =1.5) +theme_bw()
=======
# Check the similarity between the simulated data and observed datasim_data |>ggplot(aes(x = sim_emissions, y =`Average Emissions`) ) +geom_point() +labs(x ="Simulated Avg CO2 Emissions/1000 tonnes", y ="",title ="The Similarity between Observed Data and Simulated Data",subtitle ="Avg CO2 Emissions/1000 tonnes" ) +geom_abline(slope =1,intercept =0, color ="steelblue",linetype ="dashed",lwd =1.5) +theme_bw()
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
p value
Code
<<<<<<< HEAD
# check how the simulated data fit the observed data, especially the R square.lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance()
# check how the simulated data fit the observed data, especially the P value and R square.p_value<-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(p.value)|>pull()p_value
value
0.0000000000002696793
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
r square
Code
<<<<<<< HEAD
# get the R squared for the simulated data fit for observed data.sim_r2 <-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(r.squared) |>pull()sim_r2
[1] 0.8210543
=======
# get the R squared for the simulated data fit for observed data.sim_r2 <-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(r.squared) |>pull()sim_r2
# mapping to get 1000 simulated R squared.sim_r_sq <- sims |>map(~lm(`Average Emissions`~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared)head(sim_r_sq)
# mapping to get 1000 simulated R squared.sim_r_sq <- sims |>map(~lm(`Average Emissions`~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared)head(sim_r_sq)
# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")
Code
#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.
=======
# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")
Code
#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.
---title: "Investigating CO2 Emissions and Renewable Energy Usage"author: "Noel Lopez, Patrick George, Riley Svensson, Ningjing Hua"format: html: self-contained: true code-tools: true toc: true number-sections: true code-fold: trueeditor: sourceexecute: error: true echo: true message: false warning: false---```{r setup}library(tidyverse)library(broom)library(knitr)library(DT)library(kableExtra)library(gridExtra)library(patchwork)options(scipen =99999)energy <- readxl::read_xlsx(here::here("renewable_energy.xlsx"))co2 <- readxl::read_xlsx(here::here("co2.xlsx"))``````{r clean data}#| output: falseconvert_to_numeric <-function(x) { x <-str_replace_all(x, "k", "e3") x <-str_replace_all(x, "M", "e6")return(as.numeric(x))}co2_clean <- co2 |>select(country, `1989`:`2017`) |>mutate(across(.cols =`1989`:`2017`, ~convert_to_numeric(.))) |>pivot_longer(cols =!country,names_to ="year",values_to ="Total_CO2_Emissions")energy_clean <- energy |>pivot_longer(cols =!country,names_to ="year",values_to ="Percent_Consumption_Renewable")joined <-inner_join(co2_clean, energy_clean) |>drop_na()joined |>distinct(year) |>count()joined |>distinct(country) |>count()master <- joined|>rename(Year = year) |>group_by(Year) |>summarize(`Average Emissions`=mean(`Total_CO2_Emissions`),`Average Percent Renewawable`=mean(`Percent_Consumption_Renewable`))```# IntroductionClimate change is a global challenge jeopardizing humanity's future. This purpose of this project is to investigate the relationship between CO2 emissions, one of the largest drivers of climate change creating greenhouse gasses, and the use of renewable energy such as solar, wind, or hydroelectric energy. The data to be used in this project incorporates data from two different sources, one measuring total carbon dioxide emissions per country over the years and one measuring proportion of energy use which is renewable over the years. `Total_CO2_Emission` is in 1000 tons of CO2 and `Percent_Consumption_Renewables` is a percentage of total energy use produced by renewable sources per country over the years. Because the data is collected on individual countries over the years, this investigation will primarily focus on the `Average Emissions` as well as the `Average Percent Renewable` of all the countries in a specific year. In the data of which the averages will be investigated, there are 207 distinct countries observed over 29 years. The relationship between `Average Emissions`and `Average Percent Renewable` is hypothesized to be negative whereas humanity used more renewable energy on average there would be less average CO2 emissions. Through this investigation hopefully more information can be learned about climate change and potential solutions to this global problem affecting every organism on planet Earth.# Materials and MethodsThe master data to be used for analysis in this project incorporates the averages for all countries over the years from twodifferent data sets, one measuring the total carbon dioxide emissions, per country over the years and the other measuring the proportion of energy use classified as renewable per country over the years. The data sets were sourced from an online database titled *Gapminder* which has been a reliable provider of data since 2005, in hopes of promoting global sustainability through easily accessible information.Before analysis, the data had to be properly cleaned and wrangled. The CO2 data set contained values in the form of 25.3k and 4.9M, signaling the units of thousands and millions. The first step in the data cleaning process was to replace these characters with numbers to create a numeric variable which calculations could be performed on. Once the data was of all the same type, the data was then pivoted to be in tidy form and the years of interest (1989 -- 2017) were selected. Once pivoted, the data sets were joined by year and country to produce a clean master data set which shall be used for model fitting, plotting, and analysis. A final decision was made to drop all the Na's in the master set due to a few reasons. The missing values typically occurred in the `Average Percent Renewable` variable across the earlier years measured and usually in countries with smaller populations. Whether these Na's were included in the data due to lack of observation or for simply not having any renewable energy production or consumption it is unclear. Because the averages are being studied and the Na's usually occured in smaller countries, there was still a large enough sample size to average over once those observations were dropped. Other forms of imputation were contemplated such as cell-mean imputation but not conducted due to fears of introducing bias to the data.Linear regression, a method involving predicting the values of one variable, based on another, through producing a straight line minimizing the value for the sum of squared residuals, was used to create and predict a model. All the data cleaning, models, and subsequent analysis was conducted using R code.\n \n \n```{r}datatable((head(master, n =50)),caption ='Interactive Preview of Data Set')```\n# Analysis and Discussion of Model```{r}#| output: falseenergy_emissions_model <-lm(`Average Emissions`~`Average Percent Renewawable`,data = master)summary(energy_emissions_model)tidy(energy_emissions_model)```**Regression Equation:**$$ \hat{y} = 693309.6 - 17770.2 * Average\,Percent\,Renewable $$Above is the simple linear regression equation based on the model predicting the response variable, `Average Emissions` ($\hat{y}$), by the explanatory variable, `Average Percent Renewable`. The coefficient on `Average Percent Renewable` is extremely negative sitting at -17770.2 meaning that for each one percent increase in the `Average Percent Renewable` energy used the average amount of CO2 emissions in thousands of tons decreases by 17770.2. The slope coefficient of 693309.6 means that when the `Average Percent Renewable` is zero such that there is no renewable energy being used at all on average, the predicted `Average Emissions` would be 693309.6 thousand tons.\n```{r}#| fig-align: center# Plot 1raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Average Renewable Energy Consumption (%)", y ="", title ="Relationship between Renewable Energy Usage and CO2 Emissions", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'),plot.subtitle =element_text(size =10),axis.title.x =element_text(size =10))raw_graph```The graph above demonstrates the relationship between `Average Emissions` and `Average Percent Renewable`. The distribution illustrates a negative linear relationship, where the points are relatively close to the plotted regression line with little deviation and noise. There are little to no unusual observations. This illustration is consistent with the hypothesis that as the `Average Percent Renewable` increases the `Average Emissions` decreases at a significant rate.```{r}#| fig-align: centerco2_by_year_graph <- master |>ggplot(aes(x = Year , y =`Average Emissions`)) +geom_point() +scale_x_discrete(guide =guide_axis(n.dodge=2)) +labs(x ="Year", y ="", title ="Average CO2 Emissions Over Time", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'))energy_by_year_graph <- master |>ggplot(aes(x = Year, y =`Average Percent Renewawable`)) +geom_point() +scale_x_discrete(guide =guide_axis(n.dodge=2)) +labs(x ="Year", y ="", title ="Average Percentage of Renewable Energy Over Time",subtitle ="Average Renewable Energy Consumption (%)") +theme(plot.title =element_text(hjust =0.5, face ='bold'))grid.arrange(co2_by_year_graph, energy_by_year_graph)```As shown by the two distributions of `Average Emissions` and `Average Percent Renewable` over time, the relationship follows a negative relationship, but perhaps not as expected. `Average Percent Renewable` is decreasing over the years while `Average Emissions` is increasing which still illustrates a negative relationship. As time goes on it makes sense as to why `Average Emissions` is increasing, because of extreme population growth and growing demand for production but `Average Percent Renewable` has shockingly been declining in recent years. While this likely has something to due to the varying definition of renewable energy, for example whether or not nuclear energy is truly renewable, it is surprising that as technology develops renewable energy use does not. This signifies that humanity needs to increase the renewable energy production and usage on average in order to reduce the carbon footprint and preserve nature for future generations.**Model Fit:**```{r}energy_emissions_model |>augment() |>summarize(`Variance of Fitted`=var(.fitted),`Variance of Residuals`=var(.resid),`Variance of Average CO2 Emissions`=var(`Average Emissions`)) |>kable(caption ='Model Fit',digits =3,format.args =list(big.mark =",")) |>kable_styling(bootstrap_options =c('striped', 'bordered'))```The proportion of variability in the response values that was accounted for by the model, $R^{2}$, was very large at about at about 89.93 percent. This suggests a good quality model, where a lot, about 89%, of the variation in the response, `Average Emissions` is explained by the explanatory variable, `Average Percent Renewable`. This suggests that a high proportion of variability in response is accounted for by the linear model and there are not many other large factors influencing emissions.```{r}#| fig-align: centerenergy_emissions_model |>augment() |>ggplot(aes(x=.fitted, y = .resid)) +geom_point() +labs(y ='',subtitle ='Residuals',x ='Fitted Values',title ='Relationship between Residual and Fitted Values') +theme(plot.title =element_text(hjust =0.5, face ='bold'))```\n**Simulation:**```{r}noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}``````{r}master_predict <-predict(energy_emissions_model)master_sigma <-sigma(energy_emissions_model)sim_response <-tibble(sim_emissions =noise(master_predict, sd = master_sigma))raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")``````{r}obs_emissions <- master |>ggplot(aes(x =`Average Emissions`)) +geom_histogram(binwidth =3000) +labs(x ="Observed Emissions",y ="",subtitle ="Count") +theme_bw()sim_emissions_graph <- sim_response |>ggplot(aes(x = sim_emissions)) +geom_histogram(binwidth =3500) +labs(x ="Simulated Emissions",y ="",subtitle ="Count") +theme_bw()obs_emissions + sim_emissions_graph``````{r}sim_data <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewawable`) ) |>select(`Average Emissions`, `Average Percent Renewawable`) |>bind_cols(sim_response)raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Observed Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")sim_master_graph <- sim_data |>ggplot(aes(x =`Average Percent Renewawable`, y = sim_emissions) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Simulated Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Simulated Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")raw_graph + sim_master_graph``````{r}# Check the similarity between the simulated data and observed datasim_data |>ggplot(aes(x = sim_emissions, y =`Average Emissions`) ) +geom_point() +labs(x ="Simulated Avg CO2 Emissions/1000 tonnes", y ="",subtitle ="Avg CO2 Emissions/1000 tonnes" ) +geom_abline(slope =1,intercept =0, color ="steelblue",linetype ="dashed",lwd =1.5) +theme_bw()``````{r}# check how the simulated data fit the observed data, especially the R square.lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() ``````{r}# get the R squared for the simulated data fit for observed data.sim_r2 <-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(r.squared) |>pull()sim_r2``````{r}# Created 1000 simulated datasetnsims <-1000sims <-map_dfc(.x =1:nsims,.f =~tibble(sim =noise(master_predict, sd = master_sigma) ) )head(sims)``````{r}# clean the colnamescolnames(sims) <-colnames(sims) |>str_replace(pattern ="\\.\\.\\.",replace ="_")head(sims)``````{r}# bind 1000 simulated dataset and the observed dataset togethersims <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewawable`)) |>select(`Average Emissions`) |>bind_cols(sims)head(sims)``````{r}# mapping to get 1000 simulated R squared.sim_r_sq <- sims |>map(~lm(`Average Emissions`~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared)head(sim_r_sq)``````{r}# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.```# Referenceshttps://www.gapminder.org/data/
=======
---title: "Investigating CO2 Emissions and Renewable Energy Usage"author: "Noel Lopez, Patrick George, Riley Svensson, Ningjing Hua"format: html: self-contained: true code-tools: true toc: true number-sections: true code-fold: trueeditor: sourceexecute: error: true echo: true message: false warning: false---```{r setup}library(plotly)library(tidyverse)library(broom)library(knitr)library(DT)library(kableExtra)library(gridExtra)library(patchwork)library(RColorBrewer)options(scipen =99999)energy <- readxl::read_xlsx(here::here("renewable_energy.xlsx"))co2 <- readxl::read_xlsx(here::here("co2.xlsx"))``````{r clean data}#| output: falseconvert_to_numeric <-function(x) { x <-str_replace_all(x, "k", "e3") x <-str_replace_all(x, "M", "e6")return(as.numeric(x))}co2_clean <- co2 |>select(country, `1989`:`2017`) |>mutate(across(.cols =`1989`:`2017`, ~convert_to_numeric(.))) |>pivot_longer(cols =!country,names_to ="year",values_to ="Total_CO2_Emissions")energy_clean <- energy |>pivot_longer(cols =!country,names_to ="year",values_to ="Percent_Consumption_Renewable")joined <-inner_join(co2_clean, energy_clean) |>drop_na()joined |>distinct(year) |>count()joined |>distinct(country) |>count()master <- joined|>rename(Year = year) |>group_by(Year) |>mutate(Year =as.numeric(Year)) |>summarize(`Average Emissions`=mean(`Total_CO2_Emissions`),`Average Percent Renewable`=mean(`Percent_Consumption_Renewable`))```# IntroductionClimate change is a global challenge jeopardizing humanity's future. This purpose of this project is to investigate the relationship between CO2 emissions, one of the largest drivers of climate change creating greenhouse gasses, and the use of renewable energy such as solar, wind, or hydroelectric energy. The data to be used in this project incorporates data from two different sources, one measuring total carbon dioxide emissions per country over the years and one measuring proportion of energy use which is renewable over the years. `Total_CO2_Emission` is in 1000 tons of CO2 and `Percent_Consumption_Renewables` is a percentage of total energy use produced by renewable sources per country over the years. Because the data is collected on individual countries over the years, this investigation will primarily focus on the `Average Emissions` as well as the `Average Percent Renewable` of all the countries in a specific year. In the data of which the averages will be investigated, there are 207 distinct countries observed over 29 years. The relationship between `Average Emissions`and `Average Percent Renewable` is hypothesized to be negative whereas humanity used more renewable energy on average there would be less average CO2 emissions. Through this investigation hopefully more information can be learned about climate change and potential solutions to this global problem affecting every organism on planet Earth.# Materials and MethodsThe master data to be used for analysis in this project incorporates the averages for all countries over the years from twodifferent data sets, one measuring the total carbon dioxide emissions, per country over the years and the other measuring the proportion of energy use classified as renewable per country over the years. The data sets were sourced from an online database titled *Gapminder* which has been a reliable provider of data since 2005, in hopes of promoting global sustainability through easily accessible information.Before analysis, the data had to be properly cleaned and wrangled. The CO2 data set contained values in the form of 25.3k and 4.9M, signaling the units of thousands and millions. The first step in the data cleaning process was to replace these characters with numbers to create a numeric variable which calculations could be performed on. Once the data was of all the same type, the data was then pivoted to be in tidy form and the years of interest (1989 -- 2017) were selected. Once pivoted, the data sets were joined by year and country to produce a clean master data set which shall be used for model fitting, plotting, and analysis. A final decision was made to drop all the Na's in the master set due to a few reasons. The missing values typically occurred in the `Average Percent Renewable` variable across the earlier years measured and usually in countries with smaller populations. Whether these Na's were included in the data due to lack of observation or for simply not having any renewable energy production or consumption it is unclear. Because the averages are being studied and the Na's usually occured in smaller countries, there was still a large enough sample size to average over once those observations were dropped. Other forms of imputation were contemplated such as cell-mean imputation but not conducted due to fears of introducing bias to the data.Linear regression, a method involving predicting the values of one variable, based on another, through producing a straight line minimizing the value for the sum of squared residuals, was used to create and predict a model. All the data cleaning, models, and subsequent analysis was conducted using R code.\n \n \n```{r}datatable((head(master, n =50)),caption ='Interactive Preview of Data Set')```\n# Analysis and Discussion of Model```{r}#| output: falseenergy_emissions_model <-lm(`Average Emissions`~`Average Percent Renewable`,data = master)summary(energy_emissions_model)tidy(energy_emissions_model)```**Regression Equation:**$$ \hat{y} = 693309.6 - 17770.2 * Average\,Percent\,Renewable $$Above is the simple linear regression equation based on the model predicting the response variable, `Average Emissions` ($\hat{y}$), by the explanatory variable, `Average Percent Renewable`. The coefficient on `Average Percent Renewable` is extremely negative sitting at -17770.2 meaning that for each one percent increase in the `Average Percent Renewable` energy used the average amount of CO2 emissions in thousands of tons decreases by 17770.2. The slope coefficient of 693309.6 means that when the `Average Percent Renewable` is zero such that there is no renewable energy being used at all on average, the predicted `Average Emissions` would be 693309.6 thousand tons.\n```{r}#| fig-align: center# Plot 1raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Average Renewable Energy Consumption (%)", y ="", title ="Relationship between Renewable Energy Usage and CO2 Emissions", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'),plot.subtitle =element_text(size =10),axis.title.x =element_text(size =10))raw_graph```The graph above demonstrates the relationship between `Average Emissions` and `Average Percent Renewable`. The distribution illustrates a negative linear relationship, where the points are relatively close to the plotted regression line with little deviation and noise. There are little to no unusual observations. This illustration is consistent with the hypothesis that as the `Average Percent Renewable` increases the `Average Emissions` decreases at a significant rate.```{r}#| fig-align: centerco2_by_year_graph <-plot_ly( master, x =~ Year, y =~`Average Emissions`,type ='scatter',marker =list(color ='red'))co2_by_year_graph <- co2_by_year_graph |>layout(title ='Average Emissions Over Time',yaxis =list(title ='Average Emissions (1000 tons)',titlefont =list(size =14),xaxis =list(title ='Year',titlefont =list(size =14))))energy_by_year_graph <-plot_ly( master, x =~ Year, y =~`Average Percent Renewable`,marker =list(color ='green'),type ='scatter')energy_by_year_graph <- energy_by_year_graph |>layout(title ='Average Percent Renewable Over Time',yaxis =list(title ='Average Percent Renewable',titlefont =list(size =14)),xaxis =list(title ='Year',titlefont =list(size =14)))co2_by_year_graphenergy_by_year_graph```As shown by the two distributions of `Average Emissions` and `Average Percent Renewable` over time, the relationship follows a negative relationship, but perhaps not as expected. `Average Percent Renewable` is decreasing over the years while `Average Emissions` is increasing which still illustrates a negative relationship. As time goes on it makes sense as to why `Average Emissions` is increasing, because of extreme population growth and growing demand for production but `Average Percent Renewable` has shockingly been declining in recent years. While this likely has something to due to the varying definition of renewable energy, for example whether or not nuclear energy is truly renewable, it is surprising that as technology develops renewable energy use does not. This signifies that humanity needs to increase the renewable energy production and usage on average in order to reduce the carbon footprint and preserve nature for future generations.**Model Fit:**```{r}energy_emissions_model |>augment() |>summarize(`Variance of Fitted`=var(.fitted),`Variance of Residuals`=var(.resid),`Variance of Average CO2 Emissions`=var(`Average Emissions`)) |>kable(caption ='Model Fit',digits =3,format.args =list(big.mark =",")) |>kable_styling(bootstrap_options =c('striped', 'bordered'))```The proportion of variability in the response values that was accounted for by the model, $R^{2}$, was very large at about at about 89.93 percent. This suggests a good quality model, where a lot, about 89%, of the variation in the response, `Average Emissions` is explained by the explanatory variable, `Average Percent Renewable`. This suggests that a high proportion of variability in response is accounted for by the linear model and there are not many other large factors influencing emissions.```{r}#| fig-align: centerenergy_emissions_model |>augment() |>ggplot(aes(x=.fitted, y = .resid)) +geom_point() +labs(y ='',subtitle ='Residuals',x ='Fitted Values',title ='Relationship between Residual and Fitted Values') +theme(plot.title =element_text(hjust =0.5, face ='bold'))```\n**Simulation:**```{r}noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}``````{r}master_predict <-predict(energy_emissions_model)master_sigma <-sigma(energy_emissions_model)sim_response <-tibble(sim_emissions =noise(master_predict, sd = master_sigma))raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")``````{r}obs_emissions <- master |>ggplot(aes(x =`Average Emissions`)) +geom_histogram(binwidth =3000) +labs(x ="Observed Emissions",y ="",subtitle ="Count") +theme_bw()sim_emissions_graph <- sim_response |>ggplot(aes(x = sim_emissions)) +geom_histogram(binwidth =3500) +labs(x ="Simulated Emissions",y ="",subtitle ="Count") +theme_bw()obs_emissions + sim_emissions_graph``````{r}sim_data <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewable`) ) |>select(`Average Emissions`, `Average Percent Renewable`) |>bind_cols(sim_response)raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Observed Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")sim_master_graph <- sim_data |>ggplot(aes(x =`Average Percent Renewable`, y = sim_emissions) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Simulated Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Simulated Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")raw_graph + sim_master_graph``````{r}# Check the similarity between the simulated data and observed datasim_data |>ggplot(aes(x = sim_emissions, y =`Average Emissions`) ) +geom_point() +labs(x ="Simulated Avg CO2 Emissions/1000 tonnes", y ="",title ="The Similarity between Observed Data and Simulated Data",subtitle ="Avg CO2 Emissions/1000 tonnes" ) +geom_abline(slope =1,intercept =0, color ="steelblue",linetype ="dashed",lwd =1.5) +theme_bw()```p value```{r}# check how the simulated data fit the observed data, especially the P value and R square.p_value<-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(p.value)|>pull()p_value```r square```{r}# get the R squared for the simulated data fit for observed data.sim_r2 <-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(r.squared) |>pull()sim_r2``````{r}# Created 1000 simulated datasetnsims <-1000sims <-map_dfc(.x =1:nsims,.f =~tibble(sim =noise(master_predict, sd = master_sigma) ) )``````{r}# clean the colnamescolnames(sims) <-colnames(sims) |>str_replace(pattern ="\\.\\.\\.",replace ="_")```Original Average Emission data with 1000 simulated datasets```{r}# bind 1000 simulated dataset and the observed dataset togethersims <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewable`)) |>select(`Average Emissions`) |>bind_cols(sims)head(sims)```R squared for 1000 simulated datasets.```{r}# mapping to get 1000 simulated R squared.sim_r_sq <- sims |>map(~lm(`Average Emissions`~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared)head(sim_r_sq)```Simulated R squared distribution```{r}# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.```# Referenceshttps://www.gapminder.org/data/https://plotly.com/r/line-and-scatter/#custom-color-scales
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd